A person makes a doctor appointment, receives all the instructions and no-show. Who to blame? This dataset collects information from 100k 110.527 medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.
No-show: Yes or No.
Scholarship indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
Gender affect the attendance?Alcoholism affect the attendance?SMS affect the attendance?Ages according to the attendance?Neighbourhood effect the attendance?Age and Neighbourhood can affect on the attendance together?AppointmentDay in the week according to the Attendance?Age recorded in the database?Age?Age?Neighbourhoods who have the least number of patients?import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
#pd.set_option('display.max_rows', 100)
df = pd.read_csv('NoShowApp2016.csv')
df.head(5)
#df.sample(100)
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
Removing unnecessary Columns.
df.drop(columns = ['ScheduledDay'], inplace = True)
df.head(2)
| PatientId | AppointmentID | Gender | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
Changing AppointmentDay Column to be as a Date Format
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
#df.AppointmentDay.dtype
df['Day'] = df['AppointmentDay'].dt.day_name()
df.head(2)
| PatientId | AppointmentID | Gender | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | Friday |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday |
Fixing Some Columns' Names.
df.rename({'Hipertension' : 'Hypertension', 'Handcap' : 'Handicap', 'No-show' : 'Missed'}, axis = 'columns', inplace=True)
df.head(2)
| PatientId | AppointmentID | Gender | AppointmentDay | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handicap | SMS_received | Missed | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29 00:00:00+00:00 | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | Friday |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29 00:00:00+00:00 | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | Friday |
Awesome, this is now much better.
Now, Let's get more info about our dataset.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 AppointmentDay 110527 non-null datetime64[ns, UTC] 4 Age 110527 non-null int64 5 Neighbourhood 110527 non-null object 6 Scholarship 110527 non-null int64 7 Hypertension 110527 non-null int64 8 Diabetes 110527 non-null int64 9 Alcoholism 110527 non-null int64 10 Handicap 110527 non-null int64 11 SMS_received 110527 non-null int64 12 Missed 110527 non-null object 13 Day 110527 non-null object dtypes: datetime64[ns, UTC](1), float64(1), int64(8), object(4) memory usage: 11.8+ MB
# I don't know why this function puts my My Python interpreter in infinite loop!!
#pd.plotting.scatter_matrix(df);
fig = ['Gender', 'Age', 'Neighbourhood', 'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handicap', 'SMS_received', 'Missed']
df.hist(column = fig, figsize = (17, 17), color = 'C1');
#df.Gender.value_counts().plot(kind = 'bar');
y = df.Gender.unique()
print(y)
print(type(y))
x = df.Gender.value_counts()
print(x)
print(type(x))
plt.barh(y, x)
plt.gcf().set_size_inches(6, 2)
for index, value in enumerate(x):
plt.text(value, index, str(value))
plt.show()
print("We can see the ratio between the two Genders is equal to {:.2f} F/M.".format(df.Gender.value_counts()[0]/df.Gender.value_counts()[1]))
['F' 'M'] <class 'numpy.ndarray'> F 71840 M 38687 Name: Gender, dtype: int64 <class 'pandas.core.series.Series'>
We can see the ratio between the two Genders is equal to 1.86 F/M.
#df.Missed.value_counts().plot(kind = 'barh', figsize = (6,2));
y = df.Missed.unique()
x = df.Missed.value_counts()
plt.barh(y,x)
plt.gcf().set_size_inches(6,2)
for index, value in enumerate(x):
plt.text(value, index, str(value))
plt.show()
print("We can see the ratio between those who attended and those who did not is equal to {:.2f} Attended/Missed.".format(df.Missed.value_counts()[0]/df.Missed.value_counts()[1]))
We can see the ratio between those who attended and those who did not is equal to 3.95 Attended/Missed.
df.Neighbourhood.value_counts().plot(kind = 'barh', figsize = (10,15), color = 'g');
df.describe()
| PatientId | AppointmentID | Age | Scholarship | Hypertension | Diabetes | Alcoholism | Handicap | SMS_received | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 1.474963e+14 | 5.675305e+06 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
| std | 2.560949e+14 | 7.129575e+04 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
| min | 3.921784e+04 | 5.030230e+06 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.172614e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.173184e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 9.439172e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
df.Handicap.value_counts()
0 108286 1 2042 2 183 3 13 4 3 Name: Handicap, dtype: int64
Age which is -1, that does not make any sense! It should be changed in the data cleaning section.Handicap column has values from 0 to 4. Handicap column refers to the amount of disabilities that the patient is suffering from. so nothing to be changed.Lets check for duplicates.
print('Number of duplicated Rows: {}'.format(df.duplicated().sum()))
print('Number of duplications in PatientID column: {}'.format(df.PatientId.duplicated().sum()))
print('Number of duplications in AppointmentID column: {}'.format(df.AppointmentID.duplicated().sum()))
Number of duplicated Rows: 0 Number of duplications in PatientID column: 48228 Number of duplications in AppointmentID column: 0
We can see that:
No Duplicated Rows in the dataset.
PatientID has 48228 duplicates.
AppointmentID has no duplicates.
That makes sence, because same patient could have several appointments.
Lets Now change that wrong value in column Age
# After discussing the structure of the data and any problems that need to be
# cleaned, perform those cleaning steps in the second part of this section.
df.loc[df.Age == -1,'Age'] = 1
df.describe()
| PatientId | AppointmentID | Age | Scholarship | Hypertension | Diabetes | Alcoholism | Handicap | SMS_received | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 1.474963e+14 | 5.675305e+06 | 37.088892 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
| std | 2.560949e+14 | 7.129575e+04 | 23.110176 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
| min | 3.921784e+04 | 5.030230e+06 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.172614e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.173184e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 9.439172e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
Great job so far.
Since our main goal is to find why patients miss their appointments?
Let us divide our Patients into two groups; those who attended and those who did not.
attended = df[df.Missed == 'No']
not_attended = df[df.Missed == 'Yes']
print('Attended Patients Number: {}'.format(attended.shape[0]))
print('Not-Attended Patients Number: {}'.format(not_attended.shape[0]))
print('Ratio between those who attended and those who did not attended: {:.2f} (A/N)'.format(attended.shape[0]/not_attended.shape[0]))
# print("Attended {}:\n{}".format(type(attended), attended))
# print("Not Attended {}:\n{}".format(type(not_attended), not_attended))
Attended Patients Number: 88208 Not-Attended Patients Number: 22319 Ratio between those who attended and those who did not attended: 3.95 (A/N)
x_labels = df.Gender.unique() # x-axis labels
#print(x_labels, len(x_labels))
gen_attended_val = attended.Gender.value_counts() # attended y-axis value
gen_not_attended_val = not_attended.Gender.value_counts() #not-attended y-axis value
x = np.arange(len(x_labels)) # our labels' locations
width = 0.4 # bar width
fig, ax = plt.subplots(figsize = (9,6))
rec1 = ax.bar(x - width/2, gen_attended_val, width, label = 'Attended', color = 'm')
rec2 = ax.bar(x + width/2, gen_not_attended_val, width, label = 'Not-Attended', color = '0.35')
# add some text to the fig
ax.set_ylabel('Number Of Patients')
ax.set_title('Releation between Gender and Attendance')
ax.set_xticks(x, labels = x_labels) # need latest matplot lib 3.5 to accept lebels in it. other wise you will need to remove it from here and uncommnet the next line
#ax.set_xticklabels(x_labels)
ax.legend()
ax.bar_label(rec1, padding = 3)
ax.bar_label(rec2, padding = 3)
fig.tight_layout()
plt.show()
print("The Ratio between who Attended and who did not attend given that the patient is Male {:.2f} (A/N)".format(gen_attended_val[1]/gen_not_attended_val[1]))
print("The Ratio between who Attended and who did not attend given that the patient is Female {:.2f} (A/N)".format(gen_attended_val[0]/gen_not_attended_val[0]))
# fig, ax = plt.subplots()
# ax.hist(attended.Gender, label = 'Attended')
# ax.hist(not_attended.Gender, label = 'Not-Attended')
# ax.set_title("Releation between Gender and Attendance")
# ax.set_xlabel("Gender Type")
# ax.set_ylabel("Number Of Patients")
# ax.legend(loc = "upper right")
# plt.show();
The Ratio between who Attended and who did not attend given that the patient is Male 4.01 (A/N) The Ratio between who Attended and who did not attend given that the patient is Female 3.92 (A/N)
The Previouse solution using Function form
def poltSideBySideBars(x_labels, bar1_val, bar2_val, bar1_name, bar2_name, y_label, title, xfig_size = 9, yfig_size = 6):
x = np.arange(len(x_labels)) # our labels' locations
width = 0.4 # bar width
fig, ax = plt.subplots(figsize = (xfig_size, yfig_size))
rec1 = ax.bar(x - width/2, bar1_val, width, label = bar1_name)
rec2 = ax.bar(x + width/2, bar2_val, width, label = bar2_name)
ax.set_ylabel(y_label)
ax.set_title(title)
ax.set_xticks(x, labels = x_labels) # need latest matplot lib 3.5 to accept lebels in it. other wise you will need to remove it from here and uncommnet the next line
#ax.set_xticklabels(x_labels)
ax.legend()
ax.bar_label(rec1, padding = 3)
ax.bar_label(rec2, padding = 3)
fig.tight_layout()
plt.show()
x_labels = df.Gender.unique() # x-axis labels
gen_attended_val = attended.Gender.value_counts() # attended y-axis value
gen_not_attended_val = not_attended.Gender.value_counts() #not-attended y-axis value
# calling the function ...
poltSideBySideBars(x_labels, gen_attended_val, gen_not_attended_val, 'Attended',
'Not-Attended', 'Number Of Patients', 'Releation between Gender and Attendance')
print("The Ratio between who Attended and who did not attend given that the patient is Male {:.2f} (A/N)".format(gen_attended_val[1]/gen_not_attended_val[1]))
print("The Ratio between who Attended and who did not attend given that the patient is Female {:.2f} (A/N)".format(gen_attended_val[0]/gen_not_attended_val[0]))
The Ratio between who Attended and who did not attend given that the patient is Male 4.01 (A/N) The Ratio between who Attended and who did not attend given that the patient is Female 3.92 (A/N)
Trying to plot it using Seaborn.
sns.set_style('dark')
plt.figure(figsize=(10,7))
ax = sns.countplot(data = df, x = 'Gender', hue = 'Missed', palette=["teal","yellow"])
ax.set_title('Releation between Gender and Attendance')
plt.show()
x_labels = ['Who do not drink Alcohol', 'Who drink Alcohol']
alc_attended_val = attended.Alcoholism.value_counts()
alc_not_attended_val = not_attended.Alcoholism.value_counts()
# print(alc_attended_val)
# print(alc_not_attended_val)
# calling the function ...
poltSideBySideBars(x_labels, alc_attended_val, alc_not_attended_val, 'Attended',
'Not-Attended', 'Number Of Patients', 'The effect of Alcoholism on Attendance', 6, 7)
print("The Ratio between who Attended and who did not attend given that Alcoholism {:.2f} (A/N)".format(alc_attended_val[1]/alc_not_attended_val[1]))
print("The Ratio between who Attended and who did not attend given that NO Alcoholism {:.2f} (A/N)".format(alc_attended_val[0]/alc_not_attended_val[0]))
The Ratio between who Attended and who did not attend given that Alcoholism 3.96 (A/N) The Ratio between who Attended and who did not attend given that NO Alcoholism 3.95 (A/N)
x_labels = ['Who do not receive SMS', 'Who received SMS']
sms_attended_val = attended.SMS_received.value_counts()
sms_not_attended_val = not_attended.SMS_received.value_counts()
# print(sms_attended_val)
# print(sms_not_attended_val)
# calling the function ...
poltSideBySideBars(x_labels, sms_attended_val, sms_not_attended_val, 'Attended',
'Not-Attended', 'Number Of Patients', 'How receiving SMS can affect the Attendance', 6, 7)
print("The Ratio between who Attended and who did not attend given that SMS Received {:.2f} (A/N)".format(sms_attended_val[1]/sms_not_attended_val[1]))
print("The Ratio between who Attended and who did not attend given that NO SMS Received {:.2f} (A/N)".format(sms_attended_val[0]/sms_not_attended_val[0]))
The Ratio between who Attended and who did not attend given that SMS Received 2.63 (A/N) The Ratio between who Attended and who did not attend given that NO SMS Received 4.99 (A/N)
fig, ax = plt.subplots(figsize = (10,9))
ax.hist(attended.Age, alpha = 0.6, label = 'Attended', color = 'c')
ax.hist(not_attended.Age, alpha = 0.6, label = 'Not-Attended', color = 'r')
ax.set_title('Distributions of Ages according to the Attendance')
ax.set_xlabel('Ages of Patients')
ax.set_ylabel('Number of Patients')
ax.legend()
plt.show()
The Previouse solution using Function form
def plotTwoHist(column1, column2, col1_label, col2_label, title, x_label, y_label, xfig_size = 10, yfig_size = 9):
fig, ax = plt.subplots(figsize = (xfig_size, yfig_size))
ax.hist(column1, alpha = 0.6, label = 'Attended')
ax.hist(column2, alpha = 0.6, label = 'Not-Attended')
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.legend()
plt.show()
col1 = attended.Age
col2 = not_attended.Age
# calling the function ...
plotTwoHist(col1, col2, 'Attended', 'Not-Attended',
'Distributions of Ages according to the Attendance', 'Ages of Patients', 'Number of Patients')
Trying to plot it with different graph using Seaborn.
plt.figure(figsize=(10,7))
ax = sns.boxplot(data = df, x = 'Missed', y= 'Age', hue = 'Missed', palette=["teal","yellow"])
plt.show()
print('As we can see another visualization for the answer using the boxplot.')
As we can see another visualization for the answer using the boxplot.
fig, ax = plt.subplots(figsize = (9,17))
attended.Neighbourhood.value_counts().plot(kind = 'barh', alpha = 0.6, label = 'Attended', color = 'm')
not_attended.Neighbourhood.value_counts().plot(kind = 'barh', alpha = 0.6, label = 'Not-Attended', color = 'g')
ax.set_title('Distributions of Neighbourhood counts according to the Attendance')
ax.set_xlabel('Neighbourhood of Patients')
ax.set_ylabel('Number of Patients')
ax.legend()
plt.show()
print('We can see that there is an obvious significance from `Neighbourhood` and how it can affect the patient’s attendance.')
We can see that there is an obvious significance from `Neighbourhood` and how it can affect the patient’s attendance.
The Previouse solution using Function form
def plotTwoBars(column1, column2, col1_label, col2_label, title, x_label, y_label, typeofbar, xfig_size = 10, yfig_size = 9, color1 = 'C1', color2 = 'b'):
fig, ax = plt.subplots(figsize = (xfig_size, yfig_size))
column1.value_counts().plot(kind = typeofbar, alpha = 0.6, label = col1_label, color = color1)
column2.value_counts().plot(kind = typeofbar, alpha = 0.6, label = col2_label, color = color2)
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.legend()
plt.show()
col1 = attended.Neighbourhood
col2 = not_attended.Neighbourhood
# calling the function ...
plotTwoBars(col1, col2, "Attended", "Not-Attended",
'Distributions of Neighbourhood counts according to the Attendence', 'Neighbourhood of Patients',
'Number of Patients', 'barh', xfig_size = 10, yfig_size = 17)
print('We can see that there is an obvious significance from `Neighbourhood` and how it can affect the patient’s attendance.')
We can see that there is an obvious significance from `Neighbourhood` and how it can affect the patient’s attendance.
plt.figure(figsize=(18,20))
ax = sns.scatterplot(data = df, x = 'Age', y = 'Neighbourhood', hue = 'Missed', palette=["teal","yellow"])
col1 = attended.Day
col2 = not_attended.Day
# calling the function ...
plotTwoBars(col1, col2, "Attended", "Not-Attended",
'Distributions of AppointmentDay counts according to the Attendance', 'Days of the Week',
'Number of Appointments', 'bar', xfig_size = 10, yfig_size = 7, color1 = '0.8', color2 = 'r')
As we can see Tuesday is the most chosen day for having the appoinment in case for those who attended and those who did not. On the other hand, Saturday comes with the least chosen day.
# dia_val = df[df['Diabetes'] == 1].Diabetes.value_counts()
# no_dia_val = df[df['Diabetes'] == 0].Diabetes.value_counts()
# # print(dia_val)
# # print(no_dia_val)
# han_val = df[df['Handicap'] != 0].Handicap.value_counts() # since we have more values other than 1 like 2, 3, 4
# no_han_val = df[df['Handicap'] == 0].Handicap.value_counts()
# # print(hyp_val)
# # print(no_hyp_val)
max_age = df.Age.max()
print("The Oldest age recorded in the database is '{}'".format(max_age))
The Oldest age recorded in the database is '115'
mode_val = df.Age.mode()
median_val = df.Age.median()
mean_val = df.Age.mean()
print("The Mode value of the Age is '{}'".format(mode_val[0]))
print("The Median value of the Age is '{}'".format(int(median_val)))
print("The Mean value of the Age is '{}'".format(int(mean_val)))
The Mode value of the Age is '0' The Median value of the Age is '37' The Mean value of the Age is '37'
var_val = df.Age.var()
std_val = df.Age.std()
print("The Variance value of the Age is '{:.3f}'".format(var_val))
print("The Standard Deviation value of the Age is '{:.3f}'".format(std_val))
The Variance value of the Age is '534.080' The Standard Deviation value of the Age is '23.110'
least_five_neigh = df.Neighbourhood.value_counts().nsmallest(5).to_dict()
print("The least 5 Neighbourhoods with their number of Patients are:")
print(least_five_neigh)
# for place, patient_counts in least_five_neigh.items():
# print("'{}' \t\t\t '{}'".format(place, patient_counts))
The least 5 Neighbourhoods with their number of Patients are:
{'PARQUE INDUSTRIAL': 1, 'ILHAS OCEÂNICAS DE TRINDADE': 2, 'AEROPORTO': 8, 'ILHA DO FRADE': 10, 'ILHA DO BOI': 35}
The ratio between all patients who attended and who did not attend is equal to 3.95 (A/N)
No Duplicated rows in the dataset.
PatientID has 48228 Duplicates while AppointmentID has no duplicates, because the patient can register for different appointments with his PatientID.
Handicap value lies in the range of 0 to 4, according to the disabilities that the patient is suffering from.
1) Dose the type of the Gender affect the attendance?
Men was 4.01 (A/N)Women was 3.92 (A/N)Females has slightly higher number of absence than Men.2) Dose Alcoholism affect the attendance?
Alcoholism 3.96 (A/N)NO Alcoholism 3.95 (A/N)Alcohol on the attendance.3) Dose receiving SMS affect the attendance?
SMS does not affect the attendance at all, since those who did not receive SMS attended and were more than those who have received it and attended.4) What is the distribution of Ages according to the attendance?
10 years Age actually affects the attendance, also at the range between 45 to 57 years.Age we can conclude that there is no significance effect of Age on the attendance.5) Does the Neighbourhood effect the attendance?
Neighbourhood and how it can affect the patient’s attendance. 6) How can the Age and Neighbourhood can affect on the attendance together?
Age and Neighbourhood together and how they can affect on the patient’s attendance.7) What is the most and the least chosen AppointmentDay in the week according to the Attendance?
Tuesday is the most chosen day for having the appoinment in case for those who attended and those who did not. On the other hand, Saturday comes with the least chosen day.8) What is the oldest Age recorded in the database?
1159) What is the Central Tendency of the Age?
0373710) What is the Measure of Spread for the Age?
534.08023.11011) What are the least 5 Neighbourhoods who have the least number of patients?
The least 5 Neighbourhoods with their number of Patients are:
{PARQUE INDUSTRIAL: 1, ILHAS OCEÂNICAS DE TRINDADE: 2, AEROPORTO: 8, ILHA DO FRADE: 10, ILHA DO BOI: 35}
Need to do more investigations and statistical analysis on more columns in the future. But due to the shortage of time for this project I could not present and investigate more, however it was very interesting to do more.
Some of my investigations came out with no significance effect when trying to choose my independent variable such as Alcohol and SMS on the dependent variable which was Attendance.
Almost all of the data are categorical I could not use more different type of plots like: jointplot, violinplot, swarmplot, etc.